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Abstract 

This paper presents GARCON program, illustrating its functionality on a simple HEP analysis exam- 
ple. The program automatically performs rectangular cuts optimization and verification for stability 
in a multi-dimensional phase space. The program has been successfully used by a number of very 
different analyses presented in the CMS Physics Technical Design Report. The current version GAR- 
CON 2.0 incorporates the feedback the authors have received. User's Manual is included as a part of 
the note. 
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1 Introduction 



Genetic algorithm (GA) definitions along with some review information are given in Ref . . In short, GA is a 
set of algorithms inspired by concepts of natural selection with evolving individuals, which allowed to be created 
randomly, to mutate, inherit their qualities, etc. useful in optimization problems with a large number of discrete 
solutions. 

Typically, a High Energy Physics (HEP) analysis has quite a few selection criteria (cuts) to optimize for exam- 
ple a significance of the "signal" excess over "background" events: transverse energy/momenta cuts, missing 
transverse energy, angular correlations, isolation and impact parameters, etc. In such cases simple scan over multi- 
dimensional cuts space (especially when done on top of a scan over theoretical predictions parameters space, e.g. 
for SUSY) leads to CPU time demand varying from days to many years... One of the alternative methods, which 
solves the issue is to employ a Genetic Algorithm, see e.g. 

We wrote a code, GARCON |5|, which automatically performs cut optimization and verification for stability 
effectively trying ~ 10 50 cut set parameters/values permutations for millions of input events in hours time scale. 
Examples of analyses with GARCON can be found presented in the CMS Physics TDR, v.2 |6) and in recent 
papers QlUE). 

In comparison with other automated optimization methods GARCON output is transparent to user: it just says 
what rectangular cut values are optimal and recommended in an analysis. An interpretation of these cut values is 
absolutely the same as when one selects a set of rectangular cut values for each variable in a "classical" way "by 
eye", except in the case of GARCON those cut values would be optimal to deliver the best value of the function 
used for optimization 1 ). 

In this paper we describe the basics of the GA, illustrating GARCON functionality on a simple example of a "toy" 
MC generator-level analysis. A significant part of the paper consists of User's Manual describing how to use the 
program (Sec.|3}. 

GARCON version 2.0 |5 1 among many other features allows user: 

• to select an optimization function among known significance estimators, as well as to define user's own 
criteria, which may be as simple as signal to background ratio, or more complicated, including different 
systematic uncertainties separately on signal and background processes, different weights per event, etc.; 

• to define a precision of the optimization; 

• to restrict the optimization using different kind of requirements, such us minimum number of signal/background 
events to survive after final cuts, variables/processes to be used for a particular optimization run, number of 
optimizations inside one run to ensure that optimization converges/finds not just a local maximum(s), but a 
global one as well (in case of a complicated phase space); 

• to automatically verify stability of results. 
This paper has the following structure: 

• Section 2 describes details of a "toy"-study example, 

• Section 3 shows a simple example of a "classical", eye-balling approach analysis for cuts optimization, 

• Section 4 gives details on a GARCON, GA approach to cuts optimization and contains a comparison of these 
two approaches, 

• the following section is a detailed how-to user's manual. 

The chosen "toy-study" is on purpose a simple Monte Carlo (MC) analysis to illustrate GARCON functionality in 
a clear and transparent way. Much more sophisticated use-cases of the program can be found elsewhere 0[8l[9). 

1 ' Hard-coded popular significance estimators as well as a possibility for a User defined function, are described in Appen.lcl 
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2 LM6 with PYTHIA: a Toy Study 

We are working in the framework of mSUGRA model 1 10] which is derived from more general MSSM 1 1 1 ] model 
using constrains inspired by the super-gravity unification. In case of mSUGRA, the number of independent MSSM 
parameters is reduced to just five. For our illustration we selected a point in mSUGRA parameter space with the 
following values of mSUGRA parameters: 

• the universal gaugino mass mi/2 = 400 GeV, 

• the scalar mass mo = 85 GeV, 

• the trilinear soft supersymmetry-breaking parameter Aq = 0, 

• the ratio of Higgs vacuum expectation values, tan j3 = 10, 

• sign of Higgsino mixing parameter, sign(fi) > 0. 

Characteristic qualities of SUSY events, following from a consideration of signal Feynman diagrams are: large 
MET (mainly due to massive stable SUSY particles, LSP) and large jet Ets (due to heavy SUSY particles cascade 
decays). 

Background processes considered in this study are QCD, W/Z+jets, double weak-boson production and ti. 

The main generation tool is PYTHIA 6.227 |12). In addition, ISASUGRA, part of ISAJET 7.69 |13) is linked to 
PYTHIA to provide mSUGRA masses, couplings and branchings for the signal simulation. 

All simulations and analysis is done for an integrated luminosity of 10/6 -1 . 

2.1 Parameters of generation 

PYTHIA parameters for all generated processes are those for underlying events. These parameters are specially 
tuned for LHC and used by both ATLAS and CMS collaborations, defining underlying event physics can be found 
elsewhere 1141 . The simulation of the processes with large cross-sections is performed in certain intervals of px- 
The list of simulated processes and their main characteristics are listed in Table 1 . 

2.2 Variables and preselection 

Several variables characterizing the event were stored in the ASCII files: 

• number of muons 

• the highest muon px (p T ), 

• isolation parameter for the highest px muon 2 ) {ISOL 1 ^), 

• number of jets with p T > 40 GeV (Nj), 

• Et of the highest jet Et (E^), 

• Et of the third highest jet (E^), 

• missing transverse energy (E™ lss ), 

• azimuthal angle between the highest-px muon and E T " SS (if any) (A0(/i 1 , E™ lss )), 

• azimuthal angle between the highest-Ex jet and E T " SS (A(/)( jet 1 , E™ ISS )), 

2 ' ISOL = ^2 Pt (pt with respect to the beam direction) should be less or equal to 0, 0, 1, 2 GeV for the four muons when 
the muons are sorted by the ISOL parameter. The sum runs over only charged particle tracks with pt greater then 0.8 GeV 
and inside a cone of radius R = \J (Acf>) 2 + (At}) 2 = 0.3 in the azimuth-pseudorapidity space. A pt threshold of 0.8 GeV 
roughly corresponds to the pt for which tracks start looping inside the CMS Tracker. Muon tracks are not included in the 
calculation of the ISOL parameter 
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Table 1 : Data samples and their parameters 





PYTHIA Process id 


Process 


p T (GeV/c) 


Cross section* (pb) 


Ngcnerated 


^expected ' Ng Cncra t c d 


1 


39 


mSUGPvA 


no limits 


4 


10 6 


4-10~ 2 


2 


16,31 


W + jet 


20-50 


3.1-10 4 


9-10 6 


35.4 


3 






50-100 


7.9-10 3 


9-10 6 


8.8 


4 


5» 




100-200 


1.5-10 3 


7.02- 10 6 


2.1 


5 


55 




200-400 


1.4-10 2 


1.44- 10 6 


0.98 


6 


55 




>400 


8.3 


8.5-10 4 


0.98 


7 


15,30 


Z + jet 


20-50 


1.2-10 4 


3-10 6 


38.2 


8 


55 




50-100 


3.0-10 3 


3-10 6 


9.9 


9 


55 




100-200 


6.0-10 2 


3-10 6 


2.0 


10 


_»_ 




> 200 


6.4-10 1 


3-10 6 


0.21 


11 


81,82 




no limits 


840 


8.0-10 6 


0.95 


12 


22,23,25 


zz,wz,ww 




1.4 


1.0-10 6 


1.4-10~ 2 


13 




QCD 


200-400 


6.1-10 4 


5-10 7 


12.3 


14 






400-800 


2.1-10 3 


5-10 6 


4.2 


15 






> 800 


4.8-10 1 


1.5-10 5 


0.32 



* For it, the NLO cross-section is assumed 1151 . for all other processes PYTHIA (LO) cross sections are taken. 



• circularity - Circ = 2 • min(Ai, A2) / (Ai + A2), where Ai, A2 are eigenvalues of a simple matrix C a 'P = 
SEfEf, where S means sum over energies of all objects (leptons, jets, missing energy) and a, (3 = 1,2 
correspond to x and y components. In case of back-to-back di-jets Circ is close to 0, while in case of 
multi-jet topology Circ tend to be closer to 1. 

Jets are reconstructed using a cone algorithm with merging-splitting of overlapping clusters. In order to reduce the 
number of events in the data files, a minimal E™ ss cut of 50 GeV is applied at generator level, which is known 
to be non-biasing, as typical off-line cuts on E™ lss are significantly higher 1161 . Another preselections include the 
requirement to have at least two jets above 40 GeV in every event and a cut on the leading jet Et in the event to be 
above 200 GeV. The latter results from the fact that it doesn't look possible to simulate an appropriate number of 
QCD events with p T < 200 GeV/c 

2.3 Significance estimator 

The S c i2 significance estimator 1 17 1 was used for optimization: S c i2 = 2 • (\ZlT+ S — y/B), where B - is a number 
of all the background events after cuts, and S - is a number of signal events after cuts. Results are presented also in 
terms of S c l = y2 ■ (S + B) ■ log(1.0 + S/B) — 2 ■ S which follows true Poisson probability for small number 
of events better than S c i2, is shown in Ref. JS). 

2.4 Splitting statistics in two parts 

We divided statistics in two parts: to perform cuts optimization on one of them and then to verify stability of results 
on the other. It's especially important for the analyses with limited statistics: in such cases one risks to optimize 
cuts around a statistical fluke of a signal over backgrounds significance. "Blind experiment" verification approach 
allows to exclude such unstable cases. 
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Figure 1: Number of jets with Et > 40 
GeV. Solid lines denote the SUSY signal, 
while dashed lines - the sum of the SM back- 
ground distributions. Empty arrow is for clas- 
sical analysis cut choice, filled colored arrows 
(black and gray/yellow) are GARCON opti- 
mized cuts (values for verification step). 
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Figure 3: Transverse energy of the hardest-Ex 
jet. The same notations as for Fig.[0 
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Figure 2: Distribution of missing transverse 
energy. The same notations as for Fig.[Q 
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Figure 4: Transverse energy of the third- 
hardest-Ex jet. The same notations as for 
Fig.E 



3 Classical Search 

3.1 Distributions and eye-balling search for cuts 

Figures ^ -El show some of the simulated data distributions which are used in the current analysis. Solid lines 
denote the SUSY signal, while dashed lines - the sum of the SM background distributions. 
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Figure 5: Azimital angle distance between „. , _. .. . . . , . „, 

, ,. . , . . Figure 6: Distribution or the circularity. The 

leading let and transverse missing energy. . _ „. m 
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The same notations as tor Fig. [1J 

4 GARCON Analysis 
4.1 Evolution algorithms 

GARCON, GA-based programs in general, exploits evolution-kind algorithms and uses evolution-like terms: 



• Individual - is a set of qualities, which are to be optimized in a particular environment or set of requirements. 
In HEP analysis case, an Individual is a set of lower and upper rectangular cut values for each of variables 
under study/optimization. 

• Environment or set of requirements of evolutionary process in HEP analysis case is a Quality Function (QF) 
used for optimization of individuals. Significance of a signal over background is another widely used term 
in HEP community for QF. The higher QF value the better is an Individual. For a HEP analysis Quality 
Function may be as simple as S/ \/B, where S is a number of signal events and B is a total number of 
background events after cuts, or almost of any degree of complexity, including systematic uncertainties on 
different backgrounds, etc 33 . 

• A given number of individuals constitute a Community, which is involved in evolution process. 

• Each individual involved in the evolution, i.e. in breeding with a possibility of mutation of new individuals, 
death, etc. The higher is the QF of a particular individual, the more chances this individual has to partici- 
pate in breeding of new individuals and the longer it lives (participates in more breeding cycles, etc.), thus 
improving community as a whole. 

• Breeding in HEP analysis example is a producing of a new individual with qualities taken in a defined way 
from two "parent" individuals. 

• Death of an individual happens, when it passes over an age limit for it's quality: the bigger it's quality, the 
longer it lives. 

• Cataclysmic Updates may happen in evolution after a long period of stagnation in evolution, at this time 
the whole community gets renewed and gets another chance to evolve to even better quality level. In HEP 
analysis case it corresponds to a chance to find another local and ultimately a global maximum in terms of 
quality function. Obviously, the more complicated phase space of cut variables is used, the more chances 
exist that there are several local maximums in quality function optimization. 

• There are some other algorithms involved into GAs. For example mutation of a new individual. In this case, 
"new-born" individual has not just qualities of its "parents", but also some variations, which in terms of HEP 

33 GARCON allows user defined QF 
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analysis example helps evolution to find a global maximum, with less chances to fall into a local one. There 
are also random creation mechanisms serving the same purpose, etc. 

4.2 Input for GARCON 

GARCON uses the same input information as a classical analysis: arrays of variable values, see Appen. 1X1 the 
same what is needed to perform a classical eye-balling cut optimization. 

Details on chosen variables are given in Sec. 12.21 

4.3 Optimization 

Each cycle/"year" of evolution includes a community update, that is breeding process, possible mutation of new 
individuals, quality and age calculations for each individuals, death of worst and too old individuals, etc. 

As described above the better is an individual QF, the longer it lives and hence the more chances it has to produce 
new individs, improving quality of a community as a whole and the very best individ quality as a final goal. This 
very best individ or the very best set of min/max cut variable values, which corresponds to the best achievable qual- 
ity function (significance of signal over background) is a final goal and final output of the GARCON optimization 
step: rectangular cut values recommended by the optimization procedure. 

Figures0|8j|9]and^|show dynamics/evolution of the S c \2 quality function, dynamics on MET and circularity cut 
variable values and amount of time used for optimization. 

Typical optimization procedure with GARCON takes from a few seconds to several hours depending on the amount 
of statistics and additional requirements like minimal number of events to survive after all cuts, etc. As one can 
see from Fig.^|results close to the best are already achieved before the first cataclysmic update, which happend 
at year < 50 and required less than 3.5 hours of CPU time for 10 variables (Sec. l2.2t or 20 optimized parameters 
with precision on each 2.5% and about 4 • 10 5 generated events on input (after pre-selection, see Sec. l2.2t . 

Optimized values for all the cut parameters are listed in Table[2] Results in terms of chosen significance estimator 
as well as signal to background number of events ratio, final event numbers are listed in Tab. [5] Cuts are also 
illustrated on cut parameter distribution in Figs. IH6l 

Table 2: Min and max values for cut parameters. Cut values for the classical analysis are the same. Cut values 
for GARCON verification are rounded off in comparison to those we have from optimization to reflect resolution 
effects and possible lower/upper limits. 



cut parameter 


classical 


GARCON optimization 


GARCON verification 


N mu 


0-inf 


0-5 


0-5 


p\, GeV 


0-inf 


0-1020 


0-inf 


ISOU mu , GeV 


0-inf 


0-1080 


0-inf 


Nj 


4-16 


2-16 


2-16 


E\, GeV 


300-inf 


200-2220 


200-inf 


E%, GeV 


50-inf 


0-901 


0-inf 


E^' lss , GeV 


200-inf 


342-2150 


340-inf 


A0(/i\£^ ss ),rad 


0- 7T 


0.297-tt 


0.297-tt 


A^(.7'et 1 ) £? , " s ),rad 


0.262-tt 


0.245-tt 


0.245-tt 


circularity 


0.06-1 


0.0924-0.993 


0.0924-1 



Analysing cut values and their distributions (Figs. 1X161 one can see that some variables after GARCON optimiza- 
tion converge to the limits of a particular distribution. From the technical point of view the reason for this is 
because GARCON works only with input values and doesn't have plus or minus infinity e.g. at its disposal. From 
the practical point of view, it means that min or max cut on a particular variable or the whole variable is not useful 
in comparison to other variables in terms of improving signal to background significance and GARCON shows it. 
As an example we can consider E\ and Etj, before and after all cuts (except the cut on E\ or E^ correspondingly), 
the examples of variables for which GARCON and classical cut values are different: compare Figs. [3] and 0] for 
distributions before and Figs.^JJand^]- a f ter the cuts applied. 
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Figure 8: Evolution of the cuts on CIRC. No- 
tations are the same as for Fig. [7] 



Figure 7: Evolution of the cuts on MET. Up- 
per and lower curves are for min and max cut 
values on the variable. Vertical dotted-dashed 
lines show cataclysmic update times, the right 
one corresponds to a cataclysmic update after 
the best result achieved. 

4.4 Verification 

As mentioned earliler, the available MC statistics was divided in two parts. The second part is used for a "blind" 
analysis or results stability verification. 

After we got the cut values from optimization step, we round them off to the level of expected precision 4 ' for each 
parameter (see Tab.|2ji and apply them to the second half of the statistics. 

Results are shown in Tab. [5] One can see that results are stable 5 - 1 . 



Table 3: Final results comparison for classical and GARCON (for optimization and verification steps) approaches 
in terms of S c i2 and S c l significance estimators as well as ratio of final number of signal to total background 
events and those numbers of events with MC statistical error included. 



parameter 


classical optimization 


classical verification 


GARCON optimization 


GARCON verification 


Scl2 


8.1 


8.0 


15.3 


14.7 


S C L 


8.1 


8.1 


15.8 


15.2 


S/B 


0.102 


0.102 


0.506 


0.469 


^signal 


665 ±7 


663 ±7 


574 ± 7 


567 ±7 


^background 


6496 ± 160 


6503 ± 160 


1130 ± 121 


1210 ± 121 



4.5 Comparison between classical and GARCON approaches 

Difference in performance in terms of significance (8 vs. 15) and signal to background events number ratio (0.1 vs. 
0.5) may not be a typical gain when GARCON is used vs. a classical approach: classical approach may be pretty 
sophisticated (as well as time dedicated to it may be large). What is important to emphasize is that GARCON 
does optimization and verification of results stability in an automatic manner, not requiring any special treatment 
of either input data or output results and does converge to virtually the best set of cuts in typically hours time. 

4 ) Expected precision, which includes detector resolution, of course is different for different parameters (muon p-r, jet Et) 
and different HEP experiments. 

5 ' NOTE: in case there are zero generated events left after final cuts we use Oil generated events, taking slightly pessimistic 
estimation for MC statistical error and hence corresponding number of expected events: ± one — generated — event — 
weight. 
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Figure 9: Significance (S^) estimator value 
dynamics. Notations are the same as for 
Fig.0 
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Figure 10: Amount of time spent for evolu- 
tion. Notations are the same as for Fig.Q 
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Figure 1 1 : Transverse energy of the hardest- 
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Figure 12: Transverse energy of the third - 
hardest-Ex jet. The same notations as for 
Fig.H] 
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Different cases which show GARCON usage in much more complicated analyses cases can be found elsewhere 

Guam. 



5 GARCON v2.0: User's Manual. 

5.1 Introduction 

This is a user's manual on how to run GARCON, GA program v2.0 and use its output in physics analyses. The 
program is designed for rectangular cuts optimization (with same interpretation and usage of cut values as you 
would expect from classical, eye-balling rectangular cut values selection). 

You may find useful the following presentations linked to the GARCON home page : 

• "GARCON - Genetic Algorithm for Rectangular Cuts OptimizatioN" - this is a how-to use the program talk 

• "Genetic Algorithm studies and comparisons using different significance estimators" 

• "Genetic Algorithm and example application in a SUSY search" 

5.2 Installation 

For now to make user's and developers life easier code is available as a static library pre-compiled at Scientific 
Linux PCs (CERN's Linux Red Hat 9.0 version) it has a template for a user to define optimization function (several 
widely-used functions described below are also available). All you need is to get an archive from the following 
web-page (look for "Code" link there) and do: 

• go to: http://drozdets.home.cern.ch/drozdets/home/genetic/ 

• download the most recent version of the program (garcon-2_0.tar.gz), 

• do 'tar -xzf garcon-2_0.tar.gz' in any directory you would like to work with GA, 

• read and follow README file instructions on how to build executable and run GARCON. 

You will find there the following files/directories: 

• lib/libgenetic.a - is a GARCON library file. 

• quality.cc - C++ template file for you to define your own quality/significance function for optimization. 
(Look inside this file, it has two examples of user's functions with detailed comments: simple and detailed 
ones. You should be able to easily construct your own function in a similar way.) 

• Makefile - for making binary file: garcon-ga. 

• data/ - a directory, where you can store your input data files (see example of their format below). 

• dataFiles.dat, initialization.dat, verification.dat, dataErrors.dat, variablesON_OFF.dat - files with input pa- 
rameters (description is given below). 

5.3 How to run GA. 

Just use ' ./garcon-ga > output.txt' after you have prepared an executable following instructions in README file 
and put appropriate parameters into dataFiles.dat, initialization.dat, verification.dat, dataErrors.dat files. 
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5.4 Input data sample files format. 

Example is in: data/example. dat, see also Appen.lAl NOTE: This file is just an example of format, it is a shortened 
version of a real data file. 

First line is for a process name (one word). In example it is: 1ml. 

a [ Le 

Second line is for weight per event in sample (usually weight calculated as: weight = ). Do not 

forget to correspondingly increase weights when you divide your statistics into two parts, one for optimization and 
one for verification steps. In example it is: 0.324 One can also vary weight with we/g/ifCoe/parameter described 
in Sec. 1531 

Third line is for variable/cut names (one word for each). In the example file there are 12 of them: met njet et_jetl 
et_jet2 etaJepl etaJep2 deltaR cosTheta dphiJlmet dphLjlmet njet30 njet50. 

Forth and the following lines are for input data. Values of the above listed variables for each event for a particular 
process. (One input file per process! One line per event!) See example in the file. 

5.5 Input parameters. File dataFiles.dat. 

This file in each line has a PATH to a particular input data sample. The paths may be relative (relative with respect 
to the directory where you run GA) or absolute (always works). 

Example is in: dataFiles.dat release file version. 

NOTE: input file for signal is always the last one! 

You will need two such files with two lists of input files if you are to perform optimization of cuts and verification 
of results stability. 

5.6 Input parameters. File initialization.dat. 

This file has input parameters for GA, see Appen.lBl 

• internal, - three first service parameters. Just put them 0, they are not used in the public version and exist 
for debugging purposes. 

• Integer maxNumberFeatures - the number of cut parameters used in your analyses (with values provided in 
input files of course). It equals to 12 in the described above input data sample files format example. 

• Integer maxNumberProcesses - should be equal to number of input files. It equals to 9 for the dataFiles.dat 
release file version. 

• Integer maxNumberFeature Values - this says to GA how precise you want your cuts, how small step to 
use. For example if you put maxNumberFeature Values equal 40, it means that cut step corresponds to 2.5% 
(100/40) of events cut every other cut value. (Signal distributions are used to define cut values. Steps are not 
equal, they depend on events density for each distribution.) 

• Integer initNumberPopulation, greater than 1 - this would be the number of different cut sets (Individuals) 
involved in evolution. Better to avoid settings too small (<< 30) or too big (>> 500). Default value of 100 
is a reasonable choice. 

• Integer maxNumberBestlndivids, greater than and smaller than initNumberPopulation - should be much 
smaller than initNumberPopulation. This is the number of the best cut sets. These are cut sets which get 
priority in GA iteration steps. (There is always one the very best cut set which is printed in all the details.) 
Default is 5 (for 100 initNumberPopulation). 

• Integer ageLimit, greater than and smaller than yearsForEvolution - how long each cut set (Individual) 
will be involved in evolution iterations. ageLimit is also the limit of how long the very best cut set may not 
change before the whole population of cut sets will be forced to get new try (cataclysmic update). (This 
"new try" or critical update allows GA to try to find another maximum in case there are more than one local 
max for significance in a given parameter space.) Values between 10-50 are good to try. Default is 10. 
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• Real mutationFactor, from to 1 - technical. Shows a degree of randomness in mutation process. Default is 
0.8. Values between 0.01 and 1.00 can be tried. 

• Integer yearsForE volution, greater than (less than 1000) - number of iteration cycles for the whole opti- 
mization. Several hundred are OK. (This number should be several times bigger than ageLimit.) 

• Integer optimFlag, 1 or - 1, if you do optimization, if you perform verification of results. First is to be 
performed on one part of the statistics you have in analysis and is for finding the best cuts set. The second is 
to verify stability of the results using output from the optimization step. 

• Integer SignificanceChoice, 0, 1,2, 3, 4, 5 or 6: 

- is for Si = S/VB, 

- 1 is for S 2 = S/VS + B, 

- 2 is for S cV2 = 2 • (y/WTS - VB), 

- 3 is for S cL = V 2 ' ( s + B ) ■ log(1.0 + S/B) -2-S, 

- 4 is for S/ B, where B - is a number of all the background events after cuts, and S - is a number of 
signal events after cuts, 

- 5 and 6 are for user defined significance functions (5 is for a simple one and 6 allows user to access 
details for each event, see Appen.O. 

Look at the "Genetic Algorithm studies and comparisons using different significance estimators" talk for a 
hint on a particular significance estimator stability and differences between them. 

• Real minEventsCoefficientSignal and minEventsCoefficientBackground - minimal number of events, which 
should survive after cuts, for background processes calculated as minEventsCoef ficientBackground ■ 

I Tr^Nbackqround— processes ■ 7 ,9 j ■ y-n ±_ si e c • • > n • 7 - ; ± r • 1 

y2j i=1 weight* and mtnhventsCoej jicientbignal-weigrit S i gna i tor signal process. 

Default is 5. This parameter or final events number thresholds affect results stability as they do in a classical 
approach as well. 

• Real weightCoef - a re-weighting coefficient. For example if you have your samples prepared for 10/6 -1 
and would like to see how results of optimization would change for other luminosities, for example for 
100/6 -1 , you can simply put weightCoef = 10. This parameter is also useful when you divide your statistics 
to two parts for optimization and verification and don't want to remember to change weights in all the data 
samples, you may just change weightCoef (if you divide statistics half -by-half, you need to multiply weight 
for every sample by 2, or put weightCoef=2 to have calculations done for the same integrated luminosity). 

Example is in: initialization.dat release file version. 

5.7 Input parameters for verification. File verification.dat. 

First line is for number of cut sets to be verified. In example it is 4, see Appen.lDl 

Each of the best cut sets is listed in an output file after optimization step. The very best one for each iteration is 
printed in all details with cut values, and there are some details on maxNumberBestlndivids of best cut sets (cut 
values, age, etc). 

So, you may use the following procedure after optimization step is done. Do: 'grep Calculated output.txt', you 
will get a list of the best values of significances. You will likely see that there were several attempts by GA to find 
the best optimization (significance/quality increases, then stays stable, then starts over again). So, you may find 
out looking at the output file what cut sets ('Min Individ Feature Values' and 'Max Individ Feature Values' - upper 
and lower cut values) corresponds to a few maximums you find in the 'grep Calculated output.txt' listing. Just cut 
and paste them into verification.dat (two lines per cut set: 'Min Individ Feature Values' and 'Max Individ Feature 
Values'). There are four such pairs in example file. 

Then run GA again, but with optimFlag set to 0. Better to do this on a different part of the statistics ("blind" exper- 
iment) and with cut parameter values rounded off to levels of corresponding precisions to avoid cuts "overtuning". 
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5.8 Input parameters. File dataErrors.dat. 

This file has a line of "penalty/priority" factors for each process in order as they are listed in dataFiles.dat, see 
Appen.lBI 

Example is in: dataErrors.dat release file version. 
NOTE: input file for signal is always the last one! 

The idea of the "penalty /priority" factors is to apply a simple estimation on systematic effects influences (alternative 
possibility is to define a sophisticated user QF, see Sec. 15. 61 and Appen.O. If the factor values are different from 
0.0, then equations for Significance Functions described above will be calculated with modified number of signal 
and background events: S — ► S + kS ■ S, B = ^2,Bi — > B = ^2{Bi + kBi ■ Bi), where kS and kBi are introduced 
in dataErrors.dat factors. 

HINT: putting a factor equal to -1.0 will effectively switch off a particular process. (-1.0 is the minimal value for 
a factor, upper value is not limited.) 

HINT: putting kBi >= and kS <= will provide you with a conservative estimation (opposite signs - with 
optimistic values of Significance). 

HINT: one may try these factors after optimization, during "verification step", getting a feeling of different scenar- 
ios. 

5.9 Input parameters. File variablesON_OFF.dat. 

This file has a line of ON/OFF switches for each parameter in input files in order as they are listed in input data 
files, see Appen.|F| This allows user to prepare input data files with exhaustive list of possible cut parameters and 
then perform optimization studies on different combinations of the parameters. 

5.10 How to use GA results. 

Basically cut sets (pairs of 'Min Individ Feature Values' and 'Max Individ Feature Values') for each particular 
cut/parameter are similar in use and interpretation as classical, eye-balling cuts one usually looking for to discrim- 
inate between signal and background event. So, once you've got GA results and verified their stability you may go 
on with your analysis as after you find out a set of classical rectangular cuts for your distributions. 

HINT: the very best result is repeated at the end of the output. 

HINT: output contains details on dynamics of all the significance estimators available (while performing optimiza- 
tion on one of them). 

HINT: you may want to use not the very best Individual (cuts set), but for example a one corresponding to a local 
maximum with worse performance, but better stability. As described above (Sec. l5.6> one of the means of making 
results stable is to ask for a particular number of events to survive. Obviously, if weight per generated event for 
a particular process is something like 1000 expected "real" events and cuts kill all the events in MC sample, one 
effectively has ± 1000 events expected, which (sero survived events) may be a statistically unstable result. 

6 Summary 

We presented GARCON program, illustrated its functionality on a simple HEP analysis example, much more com- 
plicated examples described for example in the CMS Physics Technical Design Report. The program automatically 
performs rectangular cuts optimization and verification for stability in a multi-dimensional phase space. 

All-in-all it is a simple yet powerful ready-to-use publicly available tool with flexible and transparent optimization 
and verification parameters setup. 
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A Input data file format. 

A part of one input data example file is given below. 



lml 
. 324 

met njet et_jetl et_jet2 eta_lepl eta„lep2 deltaR cosTheta dphi_llmet dphi_jlmet njet30 njet50 

383.926 2 550.591 150.944 -0.800319 -1.32351 0.864812 0.86366 0.191271 2.90746 2 2 

229.268 3 248.019 103.515 0.930838 -1.82199 3.82118 -0.883346 0.772977 2.64147 2 2 

149.199 6 374.811 142.887 -1.5395 -1.73921 1.44724 0.876463 2.00948 1.74833 5 3 

266.147 5 369.581 131.862 1.49267 1.98226 0.950259 0.949556 0.6018 2.96775 3 2 

360.203 3 242.108 158.631 -1.63074 -1.54017 2.37681 0.733466 2.4858 2.4815 2 2 



B Initialization data format. 

An example of initialization input parameters file. 

internal 
internal 
internal 

maxNumberFeatures 12 
maxNumberProcesses 9 
maxNumberFeatureValues 40 
initNumberPopulation 100 
maxNumberBestlndivids 5 
ageLimit 10 
mutationFactor 0.8 
yearsForEvolution 400 
optimFlag 1 
Signif icanceChoice 3 
minEventsCoef f icientSignal 5.0 
minEventsCoef ficientBackground 5 . 
weightCoef 1.0 

C User defined significance function. 

The whole text of the template is shown below. 



#include <iostream> 
#include <vector> 

using namespace std; 

double UserDef inedQuality (const double S, const double B) 
{ 

// simple example of re-defined Sl=S/sqrt (B) 
// using total number of weighted signal (S) 
// and sum of background (B) events 

return S /sqrt (B+0 . 00001 ) ; 

} 



double UserDef inedQuality (const double S, 

const double B, 
const double dS, 
const double dB, 
const vector<double> expEvents, 
const vector<double> dexpEvents, 
const vector<int> genEvents, 
const vector<double> weights 
) 

{ 

// Detailed User's Qualty function 
// available variables are: 

// S - total number of weighted signal events 
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// dS - MC stat. error on total number of weighted signal events 
// B - total number of sum of weighted background events 

// dB - MC stat. error on total number of sum of weighted background events 
// expEvents - vector with weighted events (signal is the last element) 

// dexpEvents - vector with MC stat. error on weighted events (signal is the last element) 
// genEvents - corresponding numbers of generated events 
// weights - vector of weights per generated event 

// access example is demonstrated below 
if (0) 
{ 

cout << "\n\nSignal events: " << S << " +- " << dS << endl; 
cout << "Background events: " << B << " +- " << dB << endl; 
for (int i=0; i<int (expEvents . size ()) -1 ; i++) 
{ 

cout << "Background Process " << i 

<< "\nWeighted background events " << expEvents [i] 
<< " + - " << dexpEvents [i] 

<< " corresponding to " << genEvents [i] << " gen. events" 
<< " with weight " << weights [i] 
<< endl; 

} 

int sig_index = expEvents . size () -1; 
cout << "Signal Process:" 

<< "\nWeighted events " << expEvents [ sig_index] 

<< " +- " << dexpEvents [ sig_index] 

<< " corresponding to " << genEvents [ sig„index] << " gen. events" 
<< " with weight " << weights [sig_index] 
<< endl; 

} 

// simple example 

return S/sqrt (expEvents [0] +0.0000001) ; 

} 

D Verification data format. 

An example of verification data format with four different cut sets to try (to verify). 
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E Penalty factors. 

An example of a "penalty" parameters input file. In this example the last sample, signal, gets -0.1 penalty, which 
means 10% reduction in the number of signal events. 

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.1 

F Switching variables ON/OFF file format, data errors file format. 

An example of switching variables ON/OFF file format. In this example 3rd and 11th variables are switched off 
from the analysis. 

110111111101 
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